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Abstract 

The SiZer method is extended to nonparametric hazard estimation and also to 
censored density and hazard estimation. The new method aUows quick, visual sta- 
tistical inference about the important issue of statistically significant increases and 
decreases in the smooth curve estimate. This extension has required the opening of 
a new avenue of research on the interface between statistical inference and scale space. 

KEY WORDS: Scale space, Smoothing, Visual inference. 

1 Introduction 

Nonparametric hazard rate estimation is a standard tool in survival analysis, dating back 
at least to Watson and Leadbetter (1964a, b) and Rice and Rosenblatt (1976). 

For practical use, a critical issue is understanding where the hazard rate curve increases 
and where it decreases. A confounding issue is the bandwidth, i.e. the window width 
or smoothing parameter. SiZer addresses both of these problems, in the context of 
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nonparametric density and regression estimation, by combining a scale space approach to 
smoothing with a useful visualization of simultaneous statistical inference. 

The extension of SiZer developed in this paper is not straightforward. The obvious idea 
of simply plugging reweighted data into SiZer gives invalid statistical inference. Hence, a 
careful statistical accounting for the reweighting is developed. Some examples are shown 
in Section [2j Mathematical development is given in Section [3l Computational issues are 
discussed in Section SI 

It is straightforward to simultaneously develop these ideas for (i) hazard estimation; 
(ii) censored density estimation; (iii) censored hazard estimation. This is because all three 
of these cases fit very simply into a general form of estimator, using an elegant common 
notation, perhaps first published by Patil (1990, 1993). Hence all three cases are treated 
simultaneously here. For reasons of presentation, various aspects of this paper are usually 
illustrated by focusing on just one of the three cases first. 

Some other important related references include Tanner and Wong (1983), Marron 
and Padgett (1987), Lo, Mack and Wang (1989), Sarda and Vieu (1991), Miiller and 
Wang (1994), Gonzalez-Manteiga, Marron, and Cao (1996), Kousassi and Singh (1997), 
Stute (1999), Hess, Serachitopol and Brown (1999), and Jiang and Doksum (2003). 

Many readers perhaps understand that when data are censored, it is important to 
properly adjust for the bias that this creates, using Kaplan-Meier weights. An example, 
which shows that this bias can seriously impact SiZer inference, can be found in Section 
1.1 of Jiang and Marron (2003). Kaplan- Meier weights, and there application in curve 
estimation, can be thought of in several ways. An insightful view is to consider the data 
in time order. All uncensored observations, that appear before any censored observations, 
receive equal weight. After the first censored observation appears, later uncensored ob- 
servations receive more weight, etc. The intuitive content of the resulting Kaplan Meier 
weights is that the later uncensored observations, can be thought of as representing both 
themselves and also a fraction of the earlier censored observations in the data set. An 
example, which illustrates how letting one data point represent several can seriously im- 



2 



pact SiZer inference, is given in Section 1.2 of Jiang and Marron (2003), in the somewhat 
different context of data rounding. See Figure 5 of Chaudhuri and Marron (1999) for 
another ihustration of the data rounding phenomenon. If re-weighted data are simply 
plugged into SiZer, then treating the later uncensored observations as several data points, 
will affect the SiZer map in the same way, creating invalid inference. This problem is 
addressed by revising the SiZer inference, through a careful variance calculation, which 
results in correct inference. 

2 Examples 

Figure [1] shows a censored SiZer analysis of the Stanford Heart Transplant Data, from 
Kalbfleisch and Prentice (1980). The data consisting of 103 observations, originally from 
Crowley and Hu (1977) are the survival times (in days) of potential heart transplant 
recipients from their date of acceptance into the transplant program. There is censoring 
since some patients were lost to follow up before they died and since some patients were 
still alive on the closing date of the study. 

Analysis from the point of view of density estimation is shown in Figures [1^ and 
[TJ;. This shows that there are many deaths very soon after transplantation, and a long 
decreasing tail. Because of the relatively poor way in which the kernel density estimator 
handles boundaries, see e.g. Figure 2.16 of Wand and Jones (1995), at larger scales the 
estimates first increase at the left edge. SiZer shows that both the overall decrease (the 
large red region) is statistically significant, and so are the boundary effects (the thin blue 
region right at the edge). 

For these data there is more interest in analysis of the hazard rate, as done in Figures 
[T]3 and [T]i. The hazard rate is carefully defined in Section [21 but the intuitive idea is 
the instantaneous rate at which patients die. The estimate is a reweighting of the kernel 
density estimate, as can be seen from the fact that the small scale spikes in Figure [Da 
are simply magnifications of those in Figure [T^, but the scale is more appropriate for 
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survival considerations. A central question is: when is the hazard rate increasing, and 
when is it decreasing? The color scheme of SiZer is well suited to address this issue. 
Furthermore, this question is much more directly answered by the SiZer approach, than 
by more conventional confidence intervals. The red in the middle left of the SiZer map 
in Figure [TJl shows that the hazard rate significantly decreases during that time period, 
i.e. as transplants "settle in" , chances of survival increase. The red near the top on the 
right shows that there is also a longer term improvement of the chances of survival after 
one has survived for a substantial period. 

These findings are consistent with those of Jiang and Doksum (2003). An inconsis- 
tency is the blue region at the left end. As noted above this is due to poor boundary 
behavior of the kernel density estimator that underlies this inference. The local polyno- 
mial estimator developed by Jiang and Doksum (2003) avoids this problem, which is why 
their hazard estimate is mostly monotonically decreasing. An interesting open problem 
is to adapt SiZer ideas to the Jiang and Doksum local polynomial hazard estimator. 

Figure [2] shows a SiZer analysis of the device lifetime data of Aarset(1987). These 
data are uncensored and consist of 102 observations. 

The density estimates in Figure [2^ suggest a "U-shape" density. However the SiZer 
map in Figure [2]: flags only the right hand peak as statistically significant. This is likely 
due to the same inefficiency of the kernel density estimator near the boundary. 

Of more interest for these data is the hazard rate analysis shown in Figures [2)3 and[2li. 
The dominant color in the SiZer map is blue which shows that the hazard rate generally 
increases over time, which is consistent with the expected wearing over time of mechanical 
components. A disappointing feature of the family of hazard estimates in Figure [2}3, is 
that there is a spike only on the right side, while other analyses, including Aarset(1987) 
and Mudholkar, Srivastava and Kollia (1996) find a "bathtub" shape, that includes a 
spike on the left as well. This again is because of the poor boundary behavior of the 
kernel density estimator. This problem could also be addressed by a version of hazard 
estimation SiZer that is based on the local polynomial method of Jiang and Doksum 
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Figure 1: Days to death after heart transplant. Jitter plot, and family of censored density 
estimates in Figured^, with corresponding SiZer map in Figure [Tfc. Family of censored 
hazard estimates in Figure [TJd, with corresponding SiZer map in Figure [Hi. 



(2003). 

Both of these examples illustrate an important property of SiZer: it provides a gener- 
ally good big picture assessment for initial exploratory purposes. However, for addressing 
any specific problem, e.g. the boundary questions brought up in Figures [1] and [21 it may 
not be as effective as a method that specifically targets that issue (although we do not 
know of a currently implemented method that gives better statistical inference of this type 
at the boundary). Hence we propose SiZer as a broad based method for initially finding 
structure in data (and for the perhaps more important task of quickly understanding what 
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Figure 2: Device lifetime data. Family of censored density estimates in Figure [2^, with 
corresponding SiZer map in Figure [2j:. Family of censored hazard estimates in Figure 
[2)3, with corresponding SiZer map in Figure [2li. 

structures are mere sample artifacts). After one has an idea about what to look for, then 
other methods can provide deeper insights. Often the next useful step is modelling, e.g. 
as done by Mudholkar, Srivastava and Kollia (1996) for the device lifetime data. 

3 Mathematical Development 

Our extension of SiZer is most transparently explained in the context of hazard rate 
estimation. Hence this is developed in Section 13. 1[ Then the extension to censored 
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density and censored hazard estimation is done in Section 13. 2i 



3.1 Hazard Rate Mathematics 

For data Xi, independent, identically distributed with cumulative distribution func- 
tion F (x), and probability density / (x) = F' (x), the maximum likelihood estimate of F 
is the empirical cumulative distribution function Fn{x). The kernel density estimate of / 
is 

n 

fh{x)=n-'J2^hix-Xi), 
1=1 

where (•) = j^K (^) , for the kernel function K and the bandwidth h. For / supported 
on (0, oo) the hazard rate is A (x) = / (x) /[I — F (x)], and its cumulative is A (x) = 
Jq a (n) du. Watson and Leadbetter (1964a) showed, but see for example Proposition 1 
of Shorack and Wellner (1986) for a much different way to arrive at the same conclusion, 
that a natural estimate of the hazard rate is 

T / \ -1 {x — Xi) 



i=l 



_ A Kh (x - 

^-^ 11 — i ' 

i=l 

where the X(j) are the order statistics, with X(^i~^ < ■ ■ ■ < 

Derivatives are estimated by differentiation. The density derivative, /' (x), is esti- 
mated by 

n 

g{x)=n-'Y.Khi^-^^)^ (2) 
1=1 

and the hazard rate derivative. A' (x), is estimated by 
where 

<(.) = ^A-.W = J,/r(i)^ 
The variance of the density derivative estimate is: 

var (^fl (^)) = ^'^^ ^n~^ K'^ (x — Xi)^ = n^^var i^K'^ (x — Xi)) . 



Denote by ^^(yi, . . . , y„) = n"! Er=i vl " (^'' ELi Vi? and iT^,, = i^U^ " ^i) • Then 
the variance factor above is estimated by the sample variance 

n / " \ 



n 



i=l \ i=l 

n „ 
w \2 (7) 



Using the approximation F„ (x) ~ -P (x) , the variance of the derivative hazard rate is 
approximated by 



, ,\ / K', (x — Xi) 

var [Xy^ (x) j ^ var In 2^ ^ - 



n ^var 



Except for the fact that F is unknown the variance factor here could be estimated by the 
sample variance 

where for any cumulative distribution function H (x), dependence on x and /i is suppressed 
in the notation 

_ K',^{x-X,) 
Applying Fn{x) ~ F{x), we get the approximation 

^'(^F„,l'-'^k,n)- (6) 

This is an important point where there is a critical difference between this development, 
and simply using the reweighted data in ordinary SiZer. In particular, the variance factor 
([6]), now appropriately uses the weights. Thus, an isolated point with a heavy weight 
is no longer flagged as significant, because the variance estimate also increases when the 
weights are heavier. 

SiZer gets its "simultaneous inference" properties (i.e. it addresses the multiple com- 
parison problem) using a "number of independent blocks" calculation done in Section 3 
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of Chaudhuri and Marron (1999). The basis of this is the Effective Sample Size: 

ESSU-) = ^='^l[l~^'\ (7) 

which measures the "number of points in each kernel window" (this is exactly true if K 
is the uniform density window). Correct adaptation to the hazard context requires yet 
another careful twist. Naive reweighting would suggest that denominators of 1 — F (Xj) 
should be inserted. But the independent blocks calculation is based on the number of 
independent pieces of information, so instead the formula ([7]) should be retained in the 
same form. 

Thus a hazard rate version of SiZer comes from modifying the density estimation 
version, replacing the terms K'j_^ (x — Xi) in ([2|) by K'f^ {x — Xi) /[I— F„ (-'^j)], and replacing 
the variables ^ in ([3]) by Kp^ ■. 

3.2 Censored Estimation Mathematics 

Censored data comes in the form {Xi,6i) , (X„, 5„), where Xi = min (Tj, Cj) and di = 
1 (Ti < Ci), where l(^) is the indicator of event A which equals 1 if A happens and 
otherwise. 

Assume that Ti,...,Tn are independent, identically distributed with cumulative dis- 
tribution function F{x) and that Ci, C„ are independent (and independent of the Tj), 
identically distributed with cumulative distribution function G{x). Note that the cumu- 
lative distribution function of Xi, is L{x), where the corresponding cumulative survival 
function is L{x) = F (x) G (x), using the notation H (x) = 1 — H (x), for any cumulative 
distribution function H{x). 

The goal is estimation of the survival probability density / (x) = F' (x) and the 
corresponding hazard rate A(x). 

The cumulative distribution functions F and G can be estimated by the Kaplan Meier 
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(1958), i.e. Product Limit, estimators given by F„ and G„ respectively, where 



Fn 



hi) 



1~ n (71^) ' if^<^(n) 



0, if X > 



in) 



and Gn is defined similarly to F„ but with (5(j) replaced by 1 — and the 
are the order statistics version of the data with X(^i-^ < • • • < -^(n)- 
A natural kernel density estimate of / is 

For / supported on (0, 00) the corresponding estimate of the hazard rate is 



I 

n 



"[ Gn {Xi) Fn {Xi 



n — I 
1=1 



Note that these have a structure very similar to the hazard rate estimator ([T|), which is 
why it is straight forward to extend SiZer to these well. 

Derivatives are again estimated by differentiation. The density derivative, f (x), is 
estimated by 

The hazard rate derivative. A' (x), is estimated by 

The variance of the density derivative estimate is: 

(7,1. \ ( _,^6^KjJx-JQ\ , f 5,K'^{x-X,) \ 
var thix)] = var n > — = = n var — = , 

^""^ " V t", Gn{Xi) i V Gn{Xi) J 



and for the hazard rate 
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Using the approximation methods leading to ([6]), these variance factors are estimated by 

sHSlK'G^,l,-,SnK'G„,n), (8) 
using again the notation ([S]), and by 

5'('Jli^L,l'-''5"^L,n) (9) 

respectively. 

The Effective Sample Size follows in a similar spirit. Again the basis is the number 
of independent pieces of uncensored data, resulting in the formula 

mo) • 

Thus the censored density and censored hazard rate version of SiZer come from 
modifying the density estimation version, replacing the terms K'^ (x — Xj) in ([5]) by 
6iK'^ {x — Xi) /Gn (Xi) and SiK'^^ (x — Xi) / Ln (Xi) respectively, and by replacing the vari- 
ables i^Q j in ([3]) by K'q^ ■ and K'^^ ■ respectively. 

4 Fast Computation 

Because SiZer relies on a large number of smooths, it is important to use a fast compu- 
tational method. Several such are discussed by Fan and Marron (1994). The binned 
approach is especially well suited to SiZer. 

Details of the binned implementation of ff^{x) are similar to those given in Chaudhuri 
and Marron (1999), which are based on those of Fan and Marron (1994), except that the 
kernels are now divided by appropriate cumulative distribution functions. In particular, 
for the equally spaced grid of points {xj : j = 1, ■■■,§}, let the corresponding bin counts 
(computed by some method, we have always used the "linear binning" described in Fan 
and Marron (1994)) be {cqj : j = 1, ...,5}. Then for density SiZer 
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where -S'o(xj) = Ylj'=i i^j-j'^^J' ^-nd = K'f^{xj — xj'). The approximated standard 

deviation of f^{xj), is 



sd{xj) = n 



n 



The censored and hazard versions of SiZer require reconsideration of the hnear binning 
algorithm. When a data point Xi is between grid points Xj and Xj+i, hnear binning assigns 
weight 



Xj 

Xj^l — Xj 



to the bin centered at Xj, and weight 



VJi,j+l 



Xj+i — Xi 

— Xj 



to the bin centered at Xj+i, and weight to ah other bins. These result in bin counts 



i=l 



For a generic estimated cumulative distribution function Hn, these bin counts are replaced 

by 



This results in the binned approximation to the generic estimator: 

_i A 5iK'^ (x - Xi 



n 



I Hn {Xi 



where 



^HniXj) = Y.Kj-j'CH„,j', 
j'=l 



To similarly approximate sd, use 

sd{xj) = nT^^"^ 



n 



i'=i \ HJj / y 



2 

,3' I ' 
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where the factor of 




) 



in the second moment term gives the second factor of ^ 



n 



that appears in the second moment. Finally, the binned version of the Effective Sample 
Size needs to be based on the unadjusted bin counts of the uncensored data 



where cqj = Yll=i ^ijSi and Hj-j' = Kh{xj — xj'). 

5 Computational details 

The results in this paper were obtained using Matlab. The codes are available from Steve 
Marron at 

http: / / www.stat.unc.edu/postscript/papers/marron/software/ 

6 Discuss 

We extend the SiZer to censored density and hazard rate estimation. This extension 
requires a statistical accounting for the reweighting. Our censored density estimation 
method is obviously applicable to un-censored cases. The SiZer of density and hazard rate 
estimate allows one to make quick, visual statistical inferences on the issue of statistically 
significant increases and decreases in the smooth density and hazard rate estimate. 
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